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Abstract. This paper discusses the effect of hubness in zero-shot learning, when 
ridge regression is used to find a mapping between the example space to the 
label space. Contrary to the existing approach, which attempts to find a mapping 
from the example space to the label space, we show that mapping labels into the 
example space is desirable to suppress the emergence of hubs in the subsequent 
nearest neighbor search step. Assuming a simple data model, we prove that the 
proposed approach indeed reduces hubness. This was verified empirically on the 
tasks of bilingual lexicon extraction and image labeling: hubness was reduced 
with both of these tasks and the accuracy was improved accordingly. 


1 Introduction 

1.1 Background 

In recent years, zero-shot learning (ZSL) 01OI14I15I22II has been an active research topic 
in machine learning, computer vision, and natural language processing. Many practical 
applications can be formulated as a ZSL task: drug discovery ca, bilingual lexicon 
extraction Il7l8]20l . and image labeling 02111I21I22I25I . to name a few. Cross-lingual 
information retrieval ll28l can also be viewed as a ZSL task. 

ZSL can be regarded as a type of (multi-class) classification problem, in the sense 
that the classifier is given a set of known example-class label pairs (training set), with 
the goal to predict the unknown labels of new examples (test set). However, ZSL differs 
from the standard classification in that the labels for the test examples are not present 
in the training set. In standard settings, the classifier chooses, for each test example, a 
label among those observed in the training set, but this is not the case in ZSL. Moreover, 
the number of class labels can be huge in ZSL; indeed, in bilingual lexicon extraction, 
labels correspond to possible translation words, which can range over entire vocabulary 
of the target language. 

Obviously, such a task would be intractable without further assumptions. Labels are 
thus assumed to be embedded in a metric space {label space), and their distance (or 
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similarity) can be measured in this spac^ Such a label space can be built with the help 
of background knowledge or external resources; in image labeling tasks, for example, 
labels cotTespond to annotation keywords, which can be readily represented as vectors 
in a Euclidean space, either by using corpus statistics in a standard way, or by using 
the more recent techniques for learning word representations, such as the continuous 
bag-of-words or skip-gram models IISI. 

After a label space is established, one natural approach would be to use a regression 
technique on the training set to obtain a mapping function from the example space to the 
label space. This function could then be used for mapping unlabeled examples into the 
label space, where nearest neighbor search is carided out to find the label closest to the 
mapped example. Finally, this label would be output as the prediction for the example. 

To find the mapping function, some researchers use the standard linear ridge regres¬ 
sion 07I8I2OI22I . whereas others use neural networks 01112112.51 . 

In the machine learning community, meanwhile, the hubness phenomenon is 
attracting attention as a new type of the “curse of dimensionality.” This phenomenon is 
concerned with nearest neighbor methods in high-dimensional space, and states that a 
small number of objects in the dataset, or hubs, may occur as the nearest neighbor of 
many objects. The emergence of these hubs will diminish the utility of nearest neigh¬ 
bor search, because the list of nearest neighbors often contain the same hub objects 
regardless of the query object for which the list is computed. 

1.2 Research Objective and Contributions 

In this paper, we show the interaction between the regression step in ZSL and the sub¬ 
sequent nearest neighbor step has a non-negligible effect on the prediction accuracy. 

In ZSL, examples and labels are represented as vectors in high-dimensional space, 
of which the dimensionality is typically a few hundred. As demonstrated by Dinu and 
Baroni |[8| (see also Sect. [^, when ZSL is formulated as a problem of ridge regression 
from examples to labels, “hub” labels emerge, which are simultaneously the nearest 
neighbors of many mapped examples. This has the consequence of incurring bias in the 
prediction, as these labels are output as the predicted labels for these examples. The 
presence of hubs are not necessarily disadvantageous in standard classification settings; 
there may be “good” hubs as well as “bad” hubs ll23l . However, in typical ZSL tasks 
in which the label set is fine-grained and huge, hubs are nearly always harmful to the 
prediction accuracy. 

Therefore, the objective of this study is to investigate ways to suppress hubs, and to 
improve the ZSL accuracy. Our contributions are as follows. 

1. We analyze the mechanism behind the emergence of hubs in ZSL, both with ridge 
regression and ordinary least squares. It is established that hubness occurs in ZSL 
not only because of high-dimensional space, but also because ridge regression has 
conventionally been used in ZSL in a way that promotes hubness. To be precise, 
the distributions of the mapped examples and the labels are different such that hubs 
are likely to emerge. 
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Throughout the paper, we assume both the example and label spaces are Euclidean. 
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2. Drawing on the above analysis, we propose using ridge regression to map labels 
into the space of examples. This approach is contrary to that followed in existing 
work on ZSL, in which examples are mapped into label space. Our proposal is 
therefore to reverse the mapping direction. 

As shown in Sect.|^ our proposed approach outperformed the existing approach in 
an empirical evaluation using both synthetic and real data. 

3. In terms of contributions to the research on hubness, this paper is the first to pro¬ 
vide in-depth analysis of the situation in which the query and data follow different 
distributions, and to show that the variance of data matters to hubness. In particu¬ 
lar, in Sect. we provide a proposition in which the degree of bias present in the 
data, which causes hub formation, is expressed as a function of the data variance. 
In Sect|^ this proposition serves as the main tool for analyzing hubness in ZSL. 


2 Zero-Shot Learning as a Regression Problem 

Let X be a set of examples, and T be a set of class labels. In ZSL, not only examples but 
also labels are assumed to be vectors. For this reason, examples are sometimes referred 
to as source objects, and labels as target objects. In the subsequent sections of this 
paper, we mostly follow this terminology when referring to the members of X and Y. 

Let X C and Y C These spaces, and are called source space and 
target space, respectively. Although X can be the entire space Y is usually a finite 
set of points in K'^, even though its size may be enormous in some problems. 

Let Atrain = {x; | 1 = 1be the training examples (training source objects), 
and Ttrain = {y; I* = 1 ,■■■,«} be their labels (training target objects); i.e., the class label 
of example x, is y,, for each i = In a standard classification setting, the labels 

in the training set are equal to the entire set of labels; i.e., Ttrain = Y. In contrast, this 
assumption is not made in ZSL, and Ttrain is a strict subset of T. Moreover, it is assumed 
that the true class labels of test examples do not belong to Ttraini i-S-, they belong to 
T\Ttrain. 

In such a situation, it is difficult to find a function / that maps x G X directly to 
a label in T. Therefore, a popular (and also natural) approach is to learn a projection 
m:W ^ which can be done with a regression technique. With a projection function 
m at hand, the label of a new source object x G is predicted to be the one closest to 
the mapped point m(x) in the target space. The prediction function / is thus given by 

/(x) = argmin||m(x)-y|l. 

yeY 


After a source object x is projected to m(x), the task is reduced to that of nearest neigh¬ 
bor search in the target space. 


3 Hubness Phenomenon and the Variance of Data 

The utility of nearest neighbor search would be significantly reduced if the same objects 
were to appear consistently as the search result, irrespective of the query. Radovanovic 
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et al. Il23]l showed that such objects, termed hubs, indeed occur in high-dimensional 
space. Although this phenomenon may seem counter-intuitive, hubness is observed in a 
variety of real datasets and distance/similarity measures used in combination II23I24I26L 
The aim of this study is to analyze the hubness phenomenon in ZSL, which involves 
nearest neighbor search in high-dimensional space as the last step. However, as a tool 
for analyzing ZSL, the existing theory on hubness is inadequate, as it was mainly 
developed for comparing the emergence of hubness in spaces of different dimensional¬ 
ities. 


In the analysis of ZSL in Sect. 4.2 we aim to compare two distributions in the same 
space, but which differ in terms of variance. To this end, we first present a proposition 
below, which is similar in spirit to the main theorem of Radovanovic et al. Il2^ Theo¬ 
rem 1], but which distinguishes the query and data distributions, and also expresses the 
expected difference between the squared distances from queries to database objects in 
terms of their variance. 

The proposition is concerned with nearest neighbor search, in which x is a query, 
and y 1 and y 2 are two objects in a dataset. In the context of ZSL as formulated in Sect.|^ 
X represents the image of a source object in the target space (through the learned regres¬ 
sion function m), and yi and y 2 are target objects (labels) lying at different distances 
from the origin. We are interested in which of yi and y 2 are more likely to be closer to 
X, when x is sampled from a distribution X with zero mean. 

LetE[-] andVar[-] denote the expectation and variance, respectively, andlet A/^(/r,Z) 
be a multivariate normal distribution with mean fi and covariance matrix E. 


Proposition 1. Let y = [yi,.-.,yd]^ be a d-dimensional random vector, with compo¬ 
nents yi (i = l,...,d) sampled i.i.d. from a normal distribution with zero mean and 
variance s^; i.e., y ^ y, where y — A/'(0,i^I). Further let a = y/VarjTjjjyjpJ be the 
standard deviation of the squared norm ||y||^. 

Consider two fixed samples yi and y 2 of random vector y, such that the squared 
norms ofy\ and yz are JO apart. In other words. 


\\y2f-\\yif = 7(y- 


Let X be a point sampled from a distribution X with zero mean. Then, the expected 
difference A between the squared distances from yi and yz to x, i.e., 

A =Ex [||x-y 2 f ]-Ex [||x-yif] (1) 

is given by 

A = s/lyd^l^s^. (2) 

Proof. For i= 1,2, the distance between a point x and y, is given by 
l|x-y/f = ||xf + ||y,||2-2xTy,, 


and its expected value is 

Ex [llx-y/f] =Ex [||xf] -h||y,-f-2EA'[x]'^y,'=E;t [||xf] -h Hyif, 
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since E;^ [x] = 0 by assumption. Substituting this equality in Q yields 
EA-[l|x-y2f] E;,-[||x-yif] 

4 = {Ex [llxf ] + ||y 2 f) - (Ea’[|1x|12] + \\yif) = Ijyjf- ||yi||2 = y(j. (3) 

Now, it is well known that if a cZ-dimensional random vector z follows the mul¬ 
tivariate standard normal distribution A/^(0,I), then its squared norm ||z||2 follows the 
chi-squared distribution with d degrees of freedom, and its variance is 2d. Since y = sz, 
the variance of the squared norm ||y ||2 is 

(72 ^ var^; [\\yf] = Var^ [y^Hzf] = / Var^ [\\zf] = 2c//. (4) 

From © and Q, we obtain A = □ 

Note that in Proposition [T] the standard deviation a is used as a yardstick of mea¬ 
surement to allow for comparison of “similarly” located object pairs across different 
distributions; two object pairs in different distributions are regarded as similar if objects 
in each pair are ya apart as measured by the a for the respective distributions, but has 
an equal factor 7 . This technique is due to Radovanovic et al. 1i23i . 

Now, A represents the expected difference between the squared distances from x to 
yi and y 2 . Equation (|^ shows that A increases with 7 , the factor quantifying the amount 
of difference ||y 2 |P — ||yi /. This suggests that a query object sampled from X is more 
likely to be closer to object yi than to y 2 , if ||yi || 2 < ||y 2 |p; i.e., yi is closer to the origin 
than y 2 is. Because this holds for any pair of objects yi and y 2 in the dataset, we can 
conclude that the objects closest to the origin in the dataset tend to be hubs. 

Equation (|^ also states the relationship between A and the component variance 
of distribution y, by which the following is implied: For a fixed query distribution X, 
if we have two distributions for y, 3^1 = A/'(0,i2l) and 3^2 = -^(0,7^) with sj < si 
it is preferable to choose 3^i, i.e., the distribution with a smaller s^, when attempting 
to reduce hubness. Indeed, assuming the independence of X and y, we can show that 
the influence of A relative to the expected squared distance from x to y (which is also 
subject to whether y ~ 3^1 or 3 ^ 2 ), is weaker for 3^i than for 3 ^ 2 , i-e., 

^{r,d,si) A{r,d,s2) 

EA:y;,[llx-y| 12 ] Exy^[\\x-yfy 

where we wrote A explicitly as a function of 7 , d, and s. 

4 Hubness in Regression-Based Zero-Shot Learning 

In this section, we analyze the emergence of hubs in the nearest neighbor step of ZSL. 
Through the analysis, it is shown that hubs are promoted by the use of ridge regression 
in the existing formulation of ZSL, i.e., mapping source objects (examples) into the 
target (label) space. 

As a solution, we propose using ridge regression in a direction opposite to that in 
existing work. That is, we project target objects in the space of source objects, and carry 
out nearest neighbor search in the source space. Our argument for this approach consists 
of three steps. 
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1. We first show in Sect. |4. 1 1 that, with ridge regression (and ordinary least squares 
as well), mapped observation data tend to lie closer to the origin than the target 
responses do. Because the existing work formulates ZSL as a regression problem 
that projects source objects into the target space, this means that the norm of the 
projected source objects tends to be smaller than that of target objects. 

2. By combining the above result with the discussion of Sect. we then argue that 
placing source objects closer to the origin is not ideal from the perspective of reduc¬ 
ing hubness. On the contrary, placing target objects closer to the origin, as attained 
with the proposed approach, is more desirable (Sect. |4~2] i. 

3. In Sect. |4.3[ we present a simple additional argument against placing source objects 
closer to the origin; if the data is unimodal, such a configuration increases the pos¬ 
sibility of another target object falling closer to the source object. This argument 
diverges from the discussion on hubness, but again justifies the proposed approach. 

4.1 Shrinkage of Projected Objects 

We first prove that ridge regression tends to map observation data closer to the ori¬ 
gin of the space. This tendency may be easily observed in ridge regression, for which 
the penalty term shrinks the estimated coefficients towards zero. However, the above 
tendency is also inherent in ordinary least squares. 

Let II • ||f and 11-112 respectively denote the Frobenius norm and the 2-norm of ma¬ 
trices. 

Proposition 2. Let M G the solution for ridge regression with an observation 

matrix A G and a response matrix B G i.e., 

M = argmin(||MA-B||^-f A||M||f). (5) 

M 

where X >0 is a hyperparameter. Then, we have ||MA ||2 < |iB|| 2 . 

Proof (Sketch). It is well known that M = BA^ (AA^ -I- Al) *. Thus we have 

||MA|12 = ||BAT(AAT + AI)^‘a||2< |1B||2 ||AT(AAT + AI)^‘a||2. (6) 

Let (7 be the largest singular value of A. It can be shown that 

|!A'r(AAT-fAl)-'A||2 = ^<l. 

Substituting this inequality in ^ establishes the proposition. □ 

Recall that if the data is centered, the matrix 2-norm can be interpreted as an indica¬ 
tor of the variance of data along its principal axis. Proposition]^ thus indicates that the 
variance along the principal axis of the mapped observations MA tends to be smaller 
than that of responses B. 

Furthermore, this tendency even persists in the ordinary least squares with no penalty 
term (i.e., A = 0), since ||MA ||2 < ||B ||2 still holds in this case; note that A^ (AA^) A 
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is an orthogonal projection and its 2-norm is 1, but the inequality in (|^ holds regard¬ 
less. This tendency therefore cannot be completely eliminated by simply decreasing the 
ridge parameter X towards zero. 

In existing work on ZSL, A represents the (training) source objects X = [xi • • - Xn] £ 
to be mapped into the space of target objects (by projection matrix M); and B is 
the matrix of labels for the training objects, i.e., B = Y = [yi • • y^] S Although 

Proposition 1^ is thus only concerned with the training set, it suggests that the source 
objects at the time of testing, which are not in X, are also likely to be mapped closer to 
the origin of the target space than many of the target objects in Y. 


4.2 Influence of Shrinkage on Nearest Neighbor Search 


We learned in Sect. 4.1 that ridge regression (and ordinary least squares) shrink the 
mapped observation data towards the origin of the space, relative to the response. Thus, 
in existing work on ZSL in which source objects X are projected to the space of target 
objects Y, the norm of the mapped source objects is likely to be smaller than that of the 
target objects. 


The proposed approach, which was described in the beginning of Sect. follows 
the opposite direction; target objects Y are projected to the space of source objects X. 
Thus, in this case, the norm of the mapped target objects is expected to be smaller than 
that of the source objects. 


The question now is which of these configurations is preferable for the subsequent 
nearest neighbor step, and we provide an answer under the following assumptions: (i) 
The source space and the target space are of equal dimensions; (ii) the source and target 
objects are isotropically normally distributed and independent; and (iii) the projected 
data is also isotropically normally distributed, except that the variance has shrunk. 

Let Vi = A/'(0,i[I) and = A/'(0,i2l) be two multivariate normal distributions, 
with We compare two configurations of source object x and target objects y; 

(a) the one in which x ^ and y ~ 1)2, and (b) the one in which x' ~ I?2 and y' ^ Vi 
on the other hand; here, the primes in (b) were added to distinguish variables in two 
configurations. 


These two configurations are intended to model situations in (a) existing work and 
(b) our proposal. In configuration (a), x is shorter in expectation than y, and therefore 
this approximates the situation that arises from existing work. Configuration (b) repre¬ 
sents the opposite situation, and corresponds to our proposal in which y is the projected 
vector and thus is shorter in expectation than x. 


Now, we aim to verify whether the two configurations differ in terms of the like¬ 
liness of hubs emerging, using Proposition First, we scale the entire space of con¬ 
figuration (b) by ( 51 / 52)5 or equivalently, we consider transformation of the variables 
by x" = (5i/52)x' and y" = (5i/52)y'. Note that because the two variables are scaled 
equally, this change of variables preserves the nearest neighbor relations among the 
samples. See Fig.[I]for an illustration of the relationship among x, y, x', y', x", and y". 



Configuration (a): (x, y) 


(x,y) and (x",y") 


Configuration (b): (x',y') 


Fig. 1. Schematic iiiustration for Sect. |4.2| in two-dimensionaf space. The ieft and the right paneis 
depict configurations (a) and (b), respectiveiy, with the center panei showing both configuration 
(a) and the seated version of configuration (b) in the same space. A circte represents a distribution, 
with its radius indicating the standard deviation. The radius of the circles for x (on the left panel) 
and y' (right panel) is ii, whereas that of the circles for y (left panel) and x' (right panel) is J 2 . 
with Ji <S 2 - Circles x" and y" are the scaled versions of x' and y' such that the standard deviation 
(radius) of x" is equal to x, which makes the standard deviation of y" equal to ^3 = s\/s 2 - 


Let and {y^} be the components of x' and y', respectively, and let {v"} and 
{y"} be those for x" and y". Then we have 


Var[jc"] = Var 


Var[y"] = Var 


■Sl / 

- f'*' 

—X,- 


■S2 J 

VS2 

■Si / 

/ Si 

-Tr 

= - 

S2 

VS2 


2 

Var[x-] = si 



Thus, x" follows Af(0, ijl), and y" follows (ij Since both x in configuration 

(a) and x" above follow the same distribution, it now becomes possible to compare the 
properties of y and y" in light of the discussion at the end of Sect. In order to reduce 
hubness, the distribution with a smaller variance is preferred to the one with a larger 
variance, for a fixed distribution of source x (or equivalently, x"). 

It follows that y" is preferable to y, because the former has a smaller variance. As 
mentioned above, the nearest neighbor relation between the scaled variables, y" against 
x" (or equivalently x), is identical to y' against x' in configuration (b). Therefore, we 
conclude that configuration (b) is preferable to configuration (a), in the sense that the 
former is more likely to suppress hubs. 

Finally, recall that the preferred configuration (b) models the situation of our pro¬ 
posed approach, which is to map target objects in the space of source objects. 


4.3 Additional Argument for Placing Target Objects Closer to the Origin 

By assuming a unimodal data distribution of which the probability density function 
(pdf) p(z) is decreasing in ||z||, we are able to present the following proposition which 
also advocates placing the source objects outside the target objects, and not the other 
way around. 





















9 



Fig. 2. Illustration of the situation considered in Proposition]^ Here, it is assumed that ||xi || < 
||x 2 II and ||y — xi II = ||y ~ X 2 1|. The intensity of the background shading represents the values of 
the pdf of a bivariate standard normal distribution, from which y and other objects (not depicted 
in the figure) in set Y are sampled. The probability mass inside the circle centered at xi is greater 
than that centered at X 2 , as the intensity of the shading inside the two circles shows. 


Proposition[^is concerned with the placement of a source object x at a fixed distance 
r from its target object y, for which we have two alternatives xi and X 2 , located at 
different distances from the origin of the space. 

Proposition 3. Consider a finite set Y of objects (i.e., points) in a Euclidean space, 
sampled i.i.d. from a distribution whose pdf p(z) is a decreasing function o/||z||. Let 
y £ y be an object in the set, and let r > 0. Further let xi and X 2 be two objects at a 
distance r apart from y. If\\xi |j < ||x 2 ||, then the probability that y is the closest object 
in Y to X 2 is greater than that o/xi. 

Proof (Sketch). For i = 1,2, if another object in Y appears within distance r of x,, then 
y is not the nearest neighbor of x, . Thus, we aim to prove that this probability for X 2 is 
smaller than that for xi. Since objects in Y are sampled i.i.d, it suffices to prove 

[ dp{z) < [ dp{z), (7) 

JzeV2 JzeVi 

where V, (1=1,2) denote the balls centered at x, with radius r. However, Q obviously 
holds because the balls Vi and V 2 have the same radii, p{z) is a decreasing function of 
||z|j, and ||xi || < ||x 2 ||. See Figure|^for an illustration with a bivariate standard normal 
distribution in two-dimensional space. □ 

In the context of existing work on ZSL, which uses ridge regression to map source 
objects in the space of target objects, x can be regarded as a mapped source object, and 
y as its target object. Propositionj^implies that if we want to make a source object x the 
nearest neighbor of a target object y, it should rather be placed farther than y from the 
origin, but this idea is not present in the objective function Q for ridge regression; the 
first term of the objective allocates the same amount of penalty for xi and X 2 , as they are 
equally distant from the target y. On the contrary, the ridge regression actually promotes 
placement of the mapped source object x closer to the origin, as stated in Proposition]^ 


4.4 Summary of the Proposed Approach 

Drawing on the analysis presented in Sections |4.1 - |4.3 we propose performing regres¬ 
sion that maps target objects in the space of source objects, and carry out nearest neigh¬ 
bor search in the source space. This opposes the approach followed in existing work 





10 


on regression-based ZSL 0718116I20I22I . which maps source objects into the space of 
target objects. 

In the proposed approach, matrix B in Proposition [^represents the source objects 
X, and A represents the target objects Y. Therefore, ||MA|j 2 < ||B ||2 means ||MY ||2 < 
||X|| 2 , i.e., the mapped target objects tend to be placed closer than the corresponding 
source objects to the origin. 

Admittedly, the above argument for our proposal relies on strong assumptions on 
data distributions (such as normality), which do not apply to real data. However, the 
effectiveness of our proposal is verified empirically in Sect, [^by using real data. 


5 Related Work 

The first use of ridge regression in ZSL can be found in the work of Palatucci et al. ina. 
Ridge regression has since been one of the standard approaches to ZSL, especially for 
natural language processing tasks: phrase generation 13 and bilingual lexicon extrac¬ 
tion 17181201 . More recently, neural networks have been used for learning non-linear 
mapping mna. All of the regression-based methods listed above, including those 
based on neural networks, map source objects into the target space. 

ZSL can also be formulated as a problem of canonical correlation analysis (CCA). 
Hardoon et. al. ca used CCA and kernelized CCA for image labeling. Lazaridou 
et. al. 1161 compared ridge regression, CCA, singular value decomposition, and neu¬ 
ral networks in image labeling. In our experiments (Sect.j^, we use CCA as one of the 
baseline methods for comparison. 

Dinu and Baroni 0 reported the hubness phenomenon in ZSL. They proposed two 
reweighting techniques to reduce hubness in ZSL, which are applicable to cosine sim¬ 
ilarity. Tomasev et al. proposed hubness-based instance weighting schemes for 
CCA. These schemes were applied to classification problems in which multiple in¬ 
stances (vectors) in the target space have the same class label. This setting is different 
from the one assumed in this paper (see Sect, j^, i.e., we assume that a class label is 
represented by a single target vectorj^ 

Structured output learning lH addresses a problem setting similar to ZSL, except 
that the target objects typically have complex structure, and thus the cost of embedding 
objects in a vector space is prohibitive. Kernel dependency estimation Il29]l is an ap¬ 
proach that uses kernel PCA and regression to avoid this issue. In this context, nearest 
neighbor search in the target space reduces to the pre-image problem ifTSl in the implicit 
space induced by kernels. 


6 Experiments 

We evaluated the proposed approach with both synthetic and real datasets. In particular, 
it was applied to two real ZSL tasks: bilingual lexicon extraction and image labeling. 


^ Perhaps because of this difference, the method in dzi did not perform well in our experiment, 
and we do not report its result in Sect.j^ 
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The main objective of the following experiments is to verify whether our proposed 
approach is capable of suppressing hub formation and outperforming the existing ap¬ 
proach, as claimed in Sect. 


6.1 Experimental Setups 

Compared Methods. The following methods were compared. 

- Ridgex^Y- Linear ridge regression mapping source objects X into the space of 
target objects Y. This is how ridge regression was used in the existing work on ZSL 
II7I8I16I20I22I . 

- Ridgey^x- Linear ridge regression mapping target objects Y into the source space. 
This is the proposed approach (Sect. |44| . 

- CCA: Canonical correlation analysis (CCA) for ZSL ifT^ . We used the code avail¬ 
able from http://www.davidroihardoon.com/Professional/Code.html 

We calibrated the hyperparameters, i.e., the regularization parameter in ridge re¬ 
gression and the dimensionality of common feature space in CCA, by cross validation 
on the training set. 

After ridge regression or CCA is applied, both X and Y (or their images) are lo¬ 
cated in the same space, wherein we find the closest target object for a given source 
object as measured by the Euclidean distance. In addition to the Euclidean distance, we 
also tested the non-iterative contextual dissimilarity measure (NICDM) 1131 in combi¬ 
nation with RidgCx^Y CCA. NICDM adjusts the Euclidean distance to make the 
neighborhood relations more symmetrical, and is known to effectively reduce hubness 
in non-ZSL context ll24l . 

All data were centered before application of regression and CCA, as usual with 
these methods. 


Evaluation Criteria. The compared methods were evaluated in two respects: (i) the 
correctness of their prediction, and (ii) the degree of hubness in nearest neighbor search. 


Measures of Prediction Correctness. In all our experiments, ZSL was formulated as a 
ranking task; given a source object, all the target objects were ranked by their likeli¬ 
hood for the source object. As the main evaluation criterion, we used the mean average 
precision (MAP) lEl, which is one of the standard performance metrics for ranking 
methods. Note that the synthetic and the image labeling experiments are the single-label 
problems for which MAP is equal to the mean reciprocal rank iflTlI . We also report the 
top-k accuracjj^(Acci:) for k= \ and 10, which is the percentage of source objects for 
which the correct target objects are present in their k nearest neighbors. 

® In image labeling (only), we report the top-1 accuracy (Acci) macro-averaged over classes, 
to allow direct comparison with published results. Note also that Accj. with a larger k would 
not be an informative metric for the image labeling task, which only has 10 test labels. 






12 


Measure of Hubness. To measure the degree of hubness, we used the skewness of the 
(empirical) Nk distribution, following the approach in the literature II23I24I26I27L The 
Nk distribution is the distribution of the number Nk{f) of times each target object i is 
found in the top k of the ranking for source objects, and its skewness is defined as 
follows: 


(Nk skewness) 


Var[A^^]2 


where i is the total number of test objects in Y, Nifi) is the number of times the /th target 
object is in the top-k closest target objects of the source objects. A large Nk skewness 
value indicates the existence of target objects that frequently appear in the k-nearest 
neighbor lists of source objects; i.e., the emergence of hubs. 


6.2 Task Descriptions and Datasets 

We tested our method on the following ZSL tasks. 


Synthetic Task. To simulate a ZSL task, we need to generate object parrs across two 
spaces in a way that the configuration of objects is to some extent preserved across the 
spaces, but is not exactly identical. To this end, we first generated 3000-dimensional 
(column) vectors z, S K^****® for i = 0000, whose coordinates were generated 

from an i.i.d. univariate standard normal distribution. Vectors z, were treated as latent 
variables, in the sense that they were not directly observable, but only their images x, 
and y, in two different features spaces were. These images were obtained via different 
random projections, i.e., x, = Rx*/ and y,- = RyZ;, where Rx,Rx £ ]^300x3000 
dom matrices whose elements were sampled from the uniform distribution over [—1,1]. 
Because random projections preserve the length and the angle of vectors in the original 
space with high probability Il5l6l . the configuration of the projected objects is expected 
to be similar (but different) across the two spaces. 

Finally, we randomly divided object pairs {(x,',y,)}™’'*’ into the training set (8000 
pairs) and the test set (remaining 2000 pairs). 


Bilingual Lexicon Extraction. Our first real ZSL task is bilingual lexicon extraction 
IITIM . formulated as a ranking task: Given a word in the source language, the goal 
is to rank its gold translation (the one listed in an existing bilingual lexicon as the 
translation of the source word) higher than other non-translation candidate words. 

In this experiment, we evaluated the performance in the tasks of finding the En¬ 
glish translations of words in the following source languages: Czech (cs), German (de), 
French (fr), Russian (ru), Japanese (ja), and Hindi (hi). Thus, in our setting, each of 
these six languages was used as X alternately, whereas English was the target language 
Y throughout]^ 

^ We also conducted experiments with English as X and other languages as Y. The results are 
not presented here due to lack of space, but the same trend was observed. 
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Following related work 07181201 . we trained a CBOW model ifTOl on the pre-processed 
Wikipedia corpus distributed by the Polyglot projec|^(see IJl for corpus statistics), us¬ 
ing the word2vec|^tool. The window size parameter of word2vec was set to 10, with the 
dimensionality of feature vectors set to 500. 

To learn the projection function and measure the accuracy in the test set, we used 
the bilingual dictionarie^^ of Acs et al. 12 as the gold translation pairs. These gold 
pairs were randomly split into the training set (80% of the whole pairs) and the test set 
(20%). We repeated experiments on four different random splits, for which we report 
the average performance. 


Image Labeling. The second real task is image labeling, i.e., the task of finding a 
suitable word label for a given image. Thus, source objects X are the images and target 
objects Y are the word labels. 

We used the Animal with Attributes (AwA) datasej^ which consists of 30,475 
images of 50 animal classes. For image representation, we used the DeCAF features 
which are the 4096-dimensional vectors constructed with convolutional neural networks 
(CNNs). DeCAF is also available from the AwA website. To save computational cost, 
we used random projection to reduce the dimensionality of DeCAF features to 500. 

As with the bilingual lexicon extraction experiment, label features (word represen¬ 
tations) were constructed with word2vec, but this time they were trained on the English 
version of Wikipedia (as of March 4, 2015) to cover all AwA labels. Except for the 
corpus, we used the same word2vec parameters as with bilingual lexicon extraction. 

We respected the standard zero-shot setup on AwA provided with the dataset; i.e., 
the training set contained 40 labels, and test set contained the other 10 labels. 


6.3 Experimental Results 

Table [T] shows the experimental results. The trends are fairly clear; The proposed ap¬ 
proach, RidgCy^X’ outperformed other methods in both MAP and Acc;^, over all tasks. 
Ridgex^Y CCA combined with NICDM performed better than those with Eu¬ 
clidean distances, although they still lagged behind the proposed method Ridgey^x 
even with NICDM. 

The Nk skewness achieved by Ridgey^x was lower (i.e., better) than that of com¬ 
pared methods, meaning that it effectively suppressed the emergence of hub labels. In 
contrast, RidgCx^y produced a high skewness which was in line with its poor predic¬ 
tion accuracy. These results support the expectation we expressed in the discussion in 
Sect. El 

The results presented in the tables show that the degree of hubness (A(t) for all tested 
methods inversely correlates with the correctness of the output rankings, which strongly 
suggests that hubness is one major factor affecting the prediction accuracy. 

® https://sites.google.com/site/rmyeid/prejects/polyglot 
® https://code.google.com/p/word2vec/ 
http://hit.sztaki.hu/resources/dict/bylangpair/wiktionary_2013july/ 

' * http://attributes.kyb.tuebingen.mpg.de/ 
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Table 1. Experimental results: MAP is the mean average precision. Acc^ is the accuracy of the 
A:-nearest neighbor list. is the skewness of the distribution. A high skewness indicates 
the emergence of hubs (smaller is better). The bold figure indicates the best performer in each 
evaluation criteria. 


(a) Synthetic data. 


(b) MAP on bilingual lexicon extraction. 


method 

MAP Acci 

Accio 

A, 

N,o 

method 

cs 

de 

fr 

ru ja 

hi 

Ridgex^Y 

21.5 

13.8 

36.3 24.19 

12.75 

Ridgex^Y 

1.7 

1.0 

0.7 

0.5 0.9 

5.3 

Ridgex^Y + NICDM 

58.2 

47.6 

78.4 13.71 

7.94 

Ridgex^Y + NICDM 

11.3 

7.1 

5.9 

3.8 10.2 21.4 

Ridgey^Y (proposed) 

91.7 

87.6 

98.3 

0.46 

1.18 

RidgeY_>x (proposed) 40.8 30.3 46.5 31.1 42.0 40.6 

CCA 

78.9 

71.6 

91.7 

12.0 

7.56 

CCA 

24.0 18.1 

33.7 21.2 27.3 

11.8 

CCA + NICDM 

87.6 

82.3 

96.5 

0.96 

2.58 

CCA + NICDM 

30.1 

23.4 39.7 26.7 35.3 

19.3 


(c) Acc^ on bilingual lexicon extraction. 



cs 

de 


fr 


ru 


ja 

hi 


method 

Acci Accio 

Acci Accio 

Acci 

Accio 

Acci 

Accio 

Acci 

Accio 

Acc] Accio 

Ridgex^Y 

0.7 2.8 

0.4 1.6 

0.3 

1.2 

0.2 

0.8 

0.2 

1.3 

2.9 

8.2 

Ridgex^Y + NICDM 

7.2 17.9 

4.3 11.4 

3.5 

9.8 

2.1 

6.3 

6.1 

16.8 

14.4 

32.6 

Ridgey^Y (proposed) 

31.5 54.5 

21.6 43.0 

36.6 

58.6 

21.9 

43.6 

31.9 

56.3 

31.1 

55.4 

CCA 

17.9 32.7 

12.9 25.2 

27.0 

41.7 

15.2 

28.8 

20.2 

37.3 

7.4 

18.9 

CCA + NICDM 

21.9 42.3 

16.1 33.9 

31.1 

50.1 

18.7 

37.0 

25.9 

48.8 

12.4 

30.7 

(d) At skewness on bilingual lexicon extraction. 


CS 

de 


fr 


ru 


ja 

hi 

method 

A, Aio 

Al Aio 

Al 

Mo 

M 

Mo 

M 

Mo 

M 

Mo 

Ridgex^Y 

50.29 23.84 

43.00 24.37 

67.79 

35.83 

95.05 

35.36 

62.12 

22.78 

23.75 

10.84 

Ridgex^Y + NICDM 

41.56 20.38 

39.32 20.82 

57.18 

25.97 

89.08 

30.70 

57.57 

21.62 

20.33 

9.21 

Ridgey_,Y (proposed) 

11.91 10.74 

12.49 11.94 

2.56 

2.77 

4.28 

4.18 

5.15 

6.76 

10.45 

6.14 

CCA 

28.00 18.67 

36.66 18.98 

30.18 

15.95 

51.92 

21.60 

37.73 

18.27 

22.31 

8.95 

CCA + NICDM 

25.00 17.13 

32.94 17.65 

25.20 

14.65 

42.61 

20.72 

34.66 

13.16 

22.00 

8.46 


(e) Image labeling. 


method 

MAP Acci 

M 

Ridgex^Y 

46.0 

22.6 

2.61 

Ridgex^Y + NICDM 

54.2 

34.5 

2.17 

RidgeY_vY (proposed) 

62.5 

41.3 

0.08 

CCA 

26.1 

9.2 

2.00 

CCA + NICDM 

26.9 

9.3 

2.42 


For the AwA image dataset, Akata et. al. ^ the fourth row (CNN) and second 
column {(p*^) of Table 2] reported a 39.7% Acci score, using image representations 
trained with CNNs, and 100-dimensional word representations trained with word2vec. 
For comparison, our proposed approach, RidgCy^X’ evaluated in a similar setting: 
We used the DeCAF features (which were also trained with CNNs) without random 
projection as the image representation, and 100-dimensional word2vec word vectors. In 
this setup, Ridgey^x achieved a 40.0% Acci score. Although the experimental setups 
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are not exactly identical and thus the results are not directly comparable, this suggests 
that even linear ridge regression can potentially perform as well as more recent methods, 
such as Akata et al.’s, simply by exchanging the observation and response variables. 


7 Conclusion 

This paper has presented our formulation of ZSL as a regression problem of finding a 
mapping from the target space to the source space, which opposes the way in which 
regression has been applied to ZSL to date. Assuming a simple model in which data 
follows a multivariate normal distribution, we provided an explanation as to why the 
proposed direction is preferable, in terms of the emergence of hubs in the subsequent 
nearest neighbor search step. The experimental results showed that the proposed ap¬ 
proach outperforms the existing regression-based and CCA-based approaches to ZSL. 

Future research topics include; (i) extending the analysis of Sect. to cover multi¬ 
modal data distributions, or other similarity/distance measures such as cosine; (ii) in¬ 
vestigating the influence of mapping directions in other regression-based ZSL methods, 
including neural networks; and (iii) investigating the emergence of hubs in CCA. 
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